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Abstract 

We present a bandit algorithm, SAO (Stochastic and Adversarial Optimal), whose regret is, essen- 
tially, optimal both for adversarial rewards and for st ochastic rewards. Specifically, SAO combines the 
0(^/n) worst-ca se regret of Exp3 | Auer et all l2002bll for adversarial rewards and the (poly)logarithmic 



regret of UCB1 OAuer et al.L l2002all for stochastic rewards. Adversarial rewards and stochastic rewards 



are the two main settings in the literature on (non-Bayesian) multi-armed bandits. Prior work on multi- 
armed bandits treats them separately, and does not attempt to jointly optimize for both. Our result falls 
into a general theme of achieving good worst-case performance while also taking advantage of "nice" 
problem instances, an important issue in the design of algorithms with partially known inputs. 
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1 Introduction 



Multi-armed bandits (henceforth, MAB) is a simple model for sequential decision making under uncertainty 
that captures the crucial tradeoff between exploration (acquiring new information) and e xploitation (opti - 
mizing based on the information that is currently available). Introduced in early 1950-ies IIRobbinsl 1195211 . 
it has been studied intensively since then in Operations Research, Electrical Engineering, Economics, and 
Computer Science. 

The "basic" MAB framework can be formulated as a game between the player (i.e., the algorithm) 
and the adversary (i.e., the environment). The player selects actions ("arms") sequentially from a fixed, 
finite set of possible options, and receives rewards that correspond to the selected actions. For simplicity, 
it is customary to assume that the rewards are bounded in [0, 1]. In the adversarial model one makes no 
other restrictions on the sequence of rewards, while in the stochastic model we assume that the rewards 
of a given arm is an i.i.d sequence of random variables. The performance criterion is the so-called regret, 
which compares the rewards received by the player to the rewards accumulated by a hypothetical benchmark 
algorithm. A typical, standard benchmark is the best single arm. See Figured] for a precise description of 
this framework. 

Adversarial rewards and stochastic rewards are the two main reward models in the MAB literature. Both 
are now very well understood, in particular thanks to the seminal pape rs I Lai and Robbins . 1985 . Auer et ail 
2002al Jbh. In particular, the Exp3 algorithm from IIAuer et all l2002b(1 attains a re gret growing as OU /n) in 
the adversarial model, where n is the number of rounds, and UCB 1 algorithm from | Auer et al. , 2002all attains 
0(log n) in the stochastic model. Both results are essentially optimal. It is worth noting that UCB1 and Exp3 
have influenced, and to some extent inspired, a number of follow-up papers on richer MAB settings. 
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Known parameters: K arms; n rounds (n > K > 2). 
Unknown parameters (stochastic model): 

K probability distributions v 1: . . . , on [0, 1] with resp. means fix, ... , hk- 

For each round t = 1,2, ... ,n; 

(1) algorithm chooses I t G {1, . . . , K}, possibly using external randomization; 

(2) adversary simultaneously selects rewards g t = (gi,t, ■ ■ ■ , 9K,t) & [0, 1] K ■ 

- in the stochastic model, each reward g^ t ~ is drawn independently. 

(3) the forecaster receives (and observes) the reward gi t ,t- 
He does not observe the rewards from the other arms. 

Goal: Minimize the regret, defined in the adversarial model by: 

n n 

R n = max V g it - V g h . t , 

1 ' ' t=l t=l 

and in the stochastic model by: 

R n = V* ( max Hi - fx It ) . 

fr[ V- ;i K i 7 



Figure 1: The MAB framework: adversarial rewards and stochastic rewards. 

However, it is easy to see that UCB1 incurs a trivial U(n) regret in the adversarial model, whereas Exp3 
has Q(y^n) regret even in the stochastic modelQ This raises a natural question that we aim to resolve in this 
paper: can we achieve the best of both worlds? Is there a bandit algorithm which matches the performance 
of Exp3 in the adversarial model, and attains the performance of UCB1 if the rewards are in fact stochastic? 
A more specific (and slightly milder) formulation is as follows: 

Is there a bandit algorithm that has 0(y/n) regret in the adversarial model and polylog(n) regret 
in the stochastic model? 

We are not aware of any prior work on this question. Intuitively, we introduce a new tradeoff: a bandit 
algorithm has to balance between attacking the weak adversary (stochastic rewards) and defending itself 
from a more devious adversary that targets algorithm's weaknesses, such as being too aggressive if the 
reward sequence is seemingly stochastic. In particular, while the basic exploration-exploitation tradeoff 
induces 0(log n) regret in the stochastic model, and 0(y/n) regret in the adversarial model, it is not clear a 
priori what are the optimal regret guarantees for this new attack-defense tradeoff. 

We answer the above question affirmatively, with a new algorithm called SAO (Stochastic and Adver- 
sarial Optimal). To formulate our result, we need to introduce some notation. In the stochastic model, 
let Hi be the expected single-round reward from arm i. A crucial parameter is the minimal gap: A = 
mirij : Mi < M * /i* — Hi, where fj,* = max, fj,i- With this notation, UCB1 attains regret 0(|r log n) in the stochas- 
tic model, where K is the number of arms. We are looking for the following: regret E[ii n ] = 0(V Kn) 

'This is clearly tr ue for the ori ginal version of Exp3 with a mixing parameter. However, this mixing is unnecessary against 
oblivious adversaries IStoltzLl2005ll . The regret of the resulting algorithm in the stochastic model is unknown. 



2 



in the adversarial model and regret E[.R n ] = O(^) in the stochastic model, where O(-) hides polylog(n) 
factors. Our main result is as follows. 

Theorem 1.1. There exists an algorithm SAO for the MAB problem (Algorithm\l}on page\13} such that: 

(a) in the adversarial model, SAO achieves regret K[R n ] < 0(y/nK log 3 / 2 (n) logi"Q. 

(b) in the stochastic model, SAO achieves regret E[ii n ] < log 2 (n) log K). 

Moreover, with very little extra work we can obtain the corresponding high-probability versions (see Theo- 
rem l4.1l for a precise statement). 

It is easier, and more instructive, to explain the main ideas on the special case of two arms and oblivious 
adversaryU This special case (with a simplified algorithm) is presented in Section [3] The general case is 
then fleshed out in Section [4] 



Discussion. The question raised in this paper touches upon an important theme in Machine Learning, and 
more generally in the design of algorithms with partially known inputs: how to achieve a good worst-case 
performance and also take advantage of "nice" problem instances. In the context of MAB it is natural to 
focus on the distinction between stochastic and adversarial rewards, especially given the prominence of 
the two models in the MAB literature. Then our "best-of-both-worlds" question is the first-order specific 
question that one needs to resolve. Also, we provide the first analysis of the same MAB algorithm under 
both adversarial and stochastic rewards. 

Once the "best-of-both-worlds" question is settled, several follow-up questions emerge. Most immedi- 
ately, it is not clear whether the polylog factors can be improved to match the optimal guarantees for each 
respective model; a lower bound would indicate that the "attack-defence" tradeoff is fundamentally different 
from the familiar explore-exploit tradeoffs. A natural direction for further work is rewards that are adversar- 
ial on a few short time intervals, but stochastic most of the time. Moreover, it is desirable to adapt not only 
to the binary distinction between the stochastic and adversarial rewards, but also to some form of continuous 
tradeoff between the two reward models. 

Finally, we acknowledge that our solution is no more (and no less) than a theoretical proof of concept. 
More work, theoretical and experimental, and perhaps new ideas or even new algorithms, are needed for a 
practical solution. In particular, a practical algorithm should probably go beyond what we accomplish in 
this paper, along the lines of the two possible extensions mentioned above. 



Related work. The general theme of combining worst-case and optimistic performance bounds have re- 
ceived considerable attention in prior work on online learning. A natural incarnation of this theme in the 
context of MAB concerns proving upper bounds on regret that can be written in terms of some complexity 
measure of the rewards, and match the optimal worst-case bounds. To this end, a version of Exp3 achieves 
regret KG* t ), where G* n < n is the maxima l cumulative reward o f a sing le a rm, and the correspond - 
ing high probability result was recently proved in Audibert and Bubeck 1 201fj|l In Hazan and Kale 1 2009h . 



the authors obtain regret 0(yj KV*), where V* < nis the maximal "temporal variation" of the rewards 



Similar results have be en obtained for the full-feeback ("experts") version in ICesa-Bianchi et al.l 1200711 . 
Abernethv et al. J2008ll . Also, the regret bound for UCB1 depends on the gap A, and matches the optimal 
worst-case bound for the stochastic model (up to logarithmic factors). Moreover, adaptivity to "nice" prob 



lem instances is a crucial theme in the work on bandits in metric spaces BKleinberg et alll2008l.lBubeck et al. 



An oblivious adversary fixes the rew ards gij for all round t without observing the algorithm's choices. 

The result in lHazan and Kale] l2009ll does not shed light on the question in the present paper, because the "temporal variation" 
concerns actual rewards rather than expected rewards. In particular, temporal variation is minimal when the actual reward of each 
arm is constant over time, and (essentially) maximal in the stochastic model with 0-1 rewards. 
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201 ll. ISlivkinsl . l201lh . an MAB setting in which some information on similarity between arms is a priori 



available to an algorithm. 

The distinction b etween poly log (n ) and Q(y/n) regret has bee n crucial in other MAB settings : bandits 
with linear rewards IDam et all 1200811. bandits in m e tric spaces [ Kleinberg and S ivkins, l2010h. and an 
extension of MAB to auctions jBabaioff et all 120091 . bevanur and Kakadd. I2009L iBabaioff et all l20ldl . 
Interestingly, here we have four different MAB settings (including the one in this paper) in which this 
distinction occurs for four different reasons, with no apparent connections. 

A proper survey of the literature on multi-a r med b andits is beyond the scope of this paper; a reader is 
encouraged to refer to lCesa-Bianchi and Lugosil 1200611 for background. An important high-level distinction 
is between Bayesian and non-Bayesian MAB formulations. Both have a rich literature; this paper focuses 
on the latter. The "basic" MAB version defined in this paper has been extended in various papers to include 
additional information and/or assumptions about re wards. 

Most relevant to this paper are algorithms UCB1 llAuer et alll2002all and Exp3 llAuer et all l2002bh . UCB1 
has a slightly more refined regret bound than the one that we cited earlier: R n = 0(Xh- w <//* ^*-^ ) w ^ tn 
high probab ility. A matching lower bound (up to the consi derations of the varianc e and constant factors' ) 
is proved in Lai and Robbinsl 1 19851. Several recent papers | Auer and Qrtner . 201ol Honda and Takemural 
2Q10 j lAudibert et all 12009 "lAudibert and Bubeckl . bold iMaillard and Munosl. 1201 ll. iGarivier and Cappd . 



201 1 . Perchet and Rigolletl. 201 1 1 improve over UCB1, obtaining algorithms with regret bounds that are even 



closer to the lower bound. 

The regret bo und for Exp3 is K \R„, 



0(yJnK logiC), and a version of Exp3 achieves this with 
There is a nearly matching lower bound of £l(\/Kn). Recently 
Audibert and Bubeckl l20icll have shaved off the log K factor, achieving an algorithm with regret O(VKn) 



high probability llAuer et all l2002bh . 



in the adversarial model against an oblivious adversary. 



High-level ideas. For clarity, let us consider the simplified algorithm for the special case of two arms 
and oblivious adversary. The algorithm starts with the assumption that the stochastic model is true, and 
then proceeds in three phases: "exploration", "exploitation", and the "adversarial phase". In the exploration 
phase, we alternate the two arms until one of them (say, arm 1) appears significantly better than the other. 
When and if that happens, we move to the exploitation phase where we focus on arm 1, but re-sample arm 
2 with small probability. After each round we check several consistency conditions which should hold with 
high probability if the rewards are stochastic. When and if one of these conditions fails, we declare that we 
are not in the case of stochastic rewards, and switch to running a bandit algorithm for the adversarial model 
(a version of Exp3). 

Here we have an incarnation of the "attack-defense" tradeoff mentioned earlier in this section: the con- 
sistency conditions should be (a) strong enough to justify using the stochastic model as an operating assump- 
tion while the conditions hold, and (b) weak enough so that we can check them despite the low sampling 
probability of arm 2. The fact that (a) and (b) are not mutually exclusive is surprising and unexpected. 

More precisely, the consistency conditions should be strong enough to insure us from losing too much 
in the first two phases even if we are in the adversarial model. We use a specific re-sampling schedule for 
arm 1 which is rare enough so that we do not accumulate much regret if this is indeed a bad aim, and yet 
sufficient to check the consistency conditions. 

To extend to the K-wm case, we "interleave" exploration and exploitation, "deactivating" arms one by 
one as they turn out to be suboptimal. The sampling probability of a given arm increases while the arm stays 
active, and then decreases after it is deactivated, with a smooth transition between the two phases. This 
complicated behavior (and the fact that we handle general adversaries) in turn necessitate a more delicate 
analysis. 
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2 Preliminaries 

We consider randomized algorithms, in the sense that It (the arm chosen at time t) is drawn from a probabil- 
ity distribution^ on {1, ... , K}. We denote by pi )t the probability that = i. For brevity, let = \s It=i \. 

Given such a randomized algorithm, it is a well-known trick to use g^t = 9 *p I *' t as an unbiased estimate of 
the reward gi t . Now for arm i and time t we introduce: 

• Gi t = Ss=i 9'h s (fixed-arm cumulative reward from arm i up to time t), 

• Gi s t = X^s=i 9i,s (estimated cumulative reward from arm i up to time t), 

• Gij = 5Zs=i 9^ s -^M (algorithm's cumulative reward from arm i up to time t), 

• Ti(t) = X^s=i ^i,s ( tne sampling time of arm i up to time t). 

• The corresponding averages: H i>t = \ G i>t , H^ t = \ G i)t , and H ijt = G^t/T^t). 

Gij is the cumulative reward of a "fixed-arm algorithm" that always plays arm i. Recall that our bench- 
marks are maxj Gi± for the adversarial model, and maxj EfG^t] for the stochastic model. 

Note that Hi jt , Hi )t (and Gi±, G^t) are observed by an algorithm whereas Hij (and G^t) is not. Infor- 
mally, Hij and Hi jt sue estimates for the expected reward /Xj in the stochastic model, and Hi jt is an estimate 
for the benchmark reward H^t in the adversarial model. 

In the stochastic model we define the gap of aim i as Aj = (maxi <j<K ^j) — Hi, and the minimal gap 
A = min i; Al >o Aj. 

Following the literature, we measure algorithm's performance in terms of regret R n and R n as defined 
in Figure Q] The two notions of regret are somewhat different, in particular the "stochastic regret" R n 
is not exactly equal to the expected "adversarial regret" R n . However, in the stochastic model they are 

approximately equal 01^] < E[R n ] < E[R n ] + J\n log K. 



3 A simplified SAO algorithm for K = 2 arms 

We will derive a (slightly weaker version of) the main result for the special case of K = 2 arms and oblivious 
adversary, using a simplified algorithm. This version contains most of the ideas from the general case, but 
can be presented in a more lucid fashion. 

We are looking for the "best-of -both- worlds" feature: 0(y/n) regret in the adversarial model, and O(^) 
regret in the stochastic model, where A = \p,i — is the gap. Our goal in this section is to obtain this 
feature in the simplest way possible. In particular, we will hide the constants under the OQ notation, and 
will not attempt to optimize the polylog(n) factors; also, we will assume oblivious adversary. We will prove 
the following theorem: 

Theorem 3.1. Consider a MAB problem with two arms. There exists an algorithm such that: 

(a) against an oblivious adversary, its expected regret is E[i? re ] < 0{^fn log 2 n). 

(b) in the stochastic model, its expected regret satisfies E[i? n ] < log 3 n )- 
Both regret bounds also hold with probability at least 1 — 4. 

Note that in the stochastic model, regret trivially cannot be larger than An, so part (b) trivially implies 
regret E[R n ] < 6(y/n). 

This fact is well-known and easy to prove, e.g. see Proposition 34 in lAudibert and Bubecki |20ldl . 
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Our analysis proceeds via high-probability arguments and directly obtains the high-probability guaran- 
tees. The high-probability arguments tend to make the analysis cleaner; we suspect it cannot be made much 
simpler if we only seek bounds on expected regret. 



3.1 A simplified SAO (Stochastic and Adversarial Optimal) Algorithm 

The algorithm proceeds in three phases: exploration, exploitation, and adversarial phase. In the exploration 
phase, we alternate the two arms until one of them appears significantly better than the other. In exploitation 
phase, we focus on the better arm, but re-sample the other aim with small probability. We check several 
consistency conditions which should hold with high probability if the rewards are stochastic. When and if 
one of these conditions fails, we declare that we are not in the case of stochastic rewards, and switch to 



running a bandit algorithm for the adversarial model, namely algorithm Exp3.P IIAuer et al.L l2002bH . 

The algorithm is parameterized by C crn = O(logn) which we will chose later in Section [372] The 
formal description of the three phases is as follows. 

(Exploration phase) In each round t, pick an arm at random: p\ it = p 2jt = \- Go to the next phase as soon 
as t > r2(C^ rn ) and the following condition fails: 



\H 1;t -H 2jt \ <2AC crn /Vt. (1) 

Let r* be the duration of this phase. Without loss of generality, assume Hi tTit > H 2)Tst . This means, 
informally, that arm 1 is selected for exploitation. 

(Exploitation phase) In each round t > r*, pick arm 2 with probability p 2 j = |f , and arm 1 with the 
remaining probability pi )t = 1 — ||. 

After the round, check the following consistency conditions: 

8 C crn /^ < H ltt - H 2 , t < 40 C CT J (2) 

{\H\j — H\,t\ <6C crn /\/t 
\H 2 ,t-H 2 ,t\ < 6C crn /^n. 

If one of these conditions fails, go to the next phase. 

(Adversarial phase) Run algorithm Exp3.P from lAuer et al. I i2002bh . 



(3) 



Discussion. The exploration phase is simple: Condition (Q]) is chosen so that once it fails then (assuming 
stochastic rewards) the seemingly better arm is indeed the best arm with high probability. 

In the exploitation phase, we define the re-sampling schedule for aim 2 and a collection of "consistency 
conditions". The re-sampling schedule should be sufficiently rare to avoid accumulating much regret if arm 
2 is indeed a bad arm. The consistency conditions should be sufficiently strong to justify using the stochastic 
model as an operating assumption while they hold. Namely, an adversary constrained by these conditions 
should not be able to inflict too much regret on our algorithm in the first two phases. Yet, the consistency 
conditions should be weak enough so that they hold with high-probability in the stochastic model, despite 
the low sampling probability of arm 2. 

It is essential that we use both H^ t and H^t in the consistency conditions: the interplay of these two 
estimators allows us to bound regret in the adversarial model. Other than that, the conditions that we use are 
fairly natural (the surprising part is that they work). Condition © checks whether the relation between the 
two arms is consistent with the outcome of the exploration phase, i.e. whether arm 1 still seems better than 
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arm 2, but not too much better. Condition 0]) checks whether for each arm i, the estimate Hi t is close to the 
average Hit- In the stochastic model, both estimate the expected gain /ij, so we expect them to be not too 
far apart. However, our definition of "too far" should be consistent with how often a given arm is sampled. 

3.2 Concentration inequalities 

The "probabilistic" aspect of the analysis is confined to proving that several properties of estimates and 
sampling times hold with high probability. The rest of the analysis can proceed as if these properties hold 
with probability 1. In particular, we have made our core argument essentially deterministic, which greatly 
simplifies presentation. 

All high-probability results are obtained using an elementary concentration inequality loosely known as 
Chernoff Bounds. For the sake of simplicity, we use a slightly weaker formulation below (see Appendix lAl 
for a proof), which uses just one inequality for all cases. 

Theorem 3.2 (Chernoff Bounds). Let Xf, t € [re] be a independent random variables such that Xt £ [0, 1] 
for each t. Let X = ^2™ =1 Xt be their sum, and let \x = M[X]. Then 

Pr [\X - fi\ > C max(l >v ^t)] < 2e~ c/3 , for any C > 1. (4) 

We will often need to apply Chernoff Bounds to sums whose summands depend on some events in 
the execution of the algorithms and therefore are not mutually independent. However, in all cases these 
issues are but a minor technical obstacle which can be side-stepped using a slightly more careful setupj^] 
In particular, we sometimes find it useful to work in the probability space obtained by conditioning on 
the outcome of the exploration phase. Specifically, the post-exploration probability space is the probability 
space obtained by conditioning on the following events: that the exploration phase ends, that it has a specific 
duration r* , and that arm 1 is chosen for exploitation. 

Throughout the analysis, we will obtain concentration bounds that hold with probability at least 1 — 
2n~ 4 . We will often take a Union Bound over all rounds t, which will imply success probability at least 
1 — 2n~ 3 . To simplify presentation, we will allow a slight abuse of notation: we will say with high probability 
(abbreviated w.h.p.), which will mean mean with probability at least 1 — 2n~ 3 or at least 1 — 2n~ 4 , depending 
on the context. 

To parameterize the algorithm, let us fix some C crn = 12 In (re) such that Theorem 13.21 with C = C crn 
ensures success probability at least 1 — 2n" 4 . 

3.3 Analysis: adversarial model 

We need to analyze our algorithm in two different reward models. We start with the adversarial model, so 
that we can re-use some of the claims proved here to analyze the stochastic model. 

Recall that r* denotes the duration of the exploration phase (which in general is a random variable). 
Following the convention from Section |3~T1 that whenever the exploration phase ends, the arm chosen for 
exploitation is aim 1. (Note that we do not assume that arm 1 is the best aim.) 

We start the analysis by showing that the re-sampling schedule in the exploitation phase does not result 
in playing arm 2 too often. 

Claim 3.3. During the exploitation phase, arm 2 is played at most 0(r* log n) times w.h.p.. 

5 However, the independence issues appear prohibitive for K > 2 arms or if we consider a non-oblivious adversary. So for the 
general case we resorted to a more complicated analysis via martingale inequalities. 
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Proof. We will work in the post-exploration probability space S. We need to bound from above the sum 
Ylt I%,t> where t ranges over the exploitation phase. However, Chernoff Bounds do not immediately apply 
since the number of summands itself is a random variable. Further, if we condition on a specific duration of 
exploitation then we break independence between summands. We sidestep this issue by considering an al- 
ternative algorithm in which exploitation lasts indefinitely (i.e., without the stopping conditions), and which 
uses the same randomness as the original algorithm. It suffices to bound from above the number of times 
that arm 2 is played during the exploitation phase in this alternative algorithm; denote this number by N. 
Letting J t be the arm selected in round t of the alternative algorithm, we have that N = Y^t=r t +i ^{Jt=i} 
is a sum of 0-1 random variables, and in S these variables are independent. Moreover, in S it holds that 

nm = siu + iP2,t = r * KU+i it = o{n iogn). 

Therefore, the claim follows from Chernoff Bounds. □ 

Now we connect the estimated cumulative rewards Git with the benchmark G^. More specifically, we 
will bound from above several expressions of the form | H^ t — Hi >t \ . Naturally, the upper bound for arm 1 
will be stronger since this arm is played more often during exploitation. To ensure that the bound for arm 
2 is strong enough we need to play this arm "sufficiently often" during exploitation. (Whereas Claim 1331 
ensures that we do not play it "too often".) Here and elsewhere in this analysis, we find it more elegant to 
express some of the claims in terms of the average cumulative rewards (such as Hi >t , etc.) 

Claim 3.4. 

(a) With high probability, \Hi tTt — i?i )T *| < 2C crn / \frlfor each arm i. 

(b) For any round t in the exploitation phase, with high probability it holds that 

H\ 5 t — H\ t t\ < 3 C crn / \/t, 
H2,t — H2 : t\ < 3C crn / y/r^. 

Proof. For part (a), we are interested in the sum Ylt<r 9i,t h,t- As in the proof of Claim 1331 Chernoff 
Bounds do not immediately apply since the number of" summands r* is a random variable (and conditioning 
on a particular value of t* tampers with independence between summands). So let us consider an alternative 
algorithm in which the exploration phase proceeds indefinitely, without the stopping condition, and uses 
the same randomness as the original algorithmic Let J t be the arm selected in round t of this alternative 
algorithm, and define A^t = £* =1 gi^{j t =%\- Then (when run on the same problem instance) both 
algorithms coincide for any t < r*, so in particular G^t = 2 A^t. Now, A^t is the sum of bounded 
independent random variables with expectation Gn/2. Therefore by Chernoff Bounds w.h.p. it holds that 
\Ai t t — Gi t t/2\ < C crn vr for each t, which implies the claim. 

For part (b), we will analyze the exploitation phase separately. Let us work in the post-exploration 
probability space S. We will consider the alternative algorithm from the proof of Claim 13.31 (in which 
exploitation continues indefinitely). This way we do not need worry that we implicitly condition on the 
event that a particular round t > r* belongs to the exploitation phase. Clearly, it suffices to prove © for this 
alternative algorithm. To facilitate the notation, define the time interval INT = {r* + 1, . . . , £}, and denote 
Gijsr = J2seim9i,t and G^mt = XLeiNT 9i,t- 

To handle arm 1, note that in S, G^int is a sum of independent random variables, with expectation 
G^iijt- Since p\,t > \ f° r an Y t G INT, the summands gij are bounded by 2. Therefore by Chernoff 
Bounds with high probability it holds that 

IG^int — GijntI < 2 C CTIL \/t — T*. 
f, Note that this is not the same "alternative algorithm" as the one in the proof of Claim [331 
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From this and part (a) it follows that w.h.p. 

\Gi,t - G\,t\ < 2Ccrn(v^ + V* - r *) < 3G crn v£, 

which implies the claim for arm 1. 

Handling arm 2 requires a little more work since the summands g2,t may be large (since they have a 
small probability p2 t t in the denominator). For each t £ INT, 



G 2 ,INT - EseiNT — 92,s h,8 - — EseiNT T 92,s h,s - — EseiNT X S1 



where X s = | <?2,s ^2,s G [0, 1]. In 5, random variables I s , s £ INT are mutually independent, and the 
expectation of their sum is 



M = E[£ S6INT * S ] = %E 



G 



2, INT 



2t <^2,INT S T ~ ^ 



Noting that G 2) int < i — t* and letting a = ^f, we obtain /i < ^ (1 — a). By Chernoff Bounds w.h.p. it 
holds that 



Y.seim x s ~ A* I < G crn \/r* (1 - a). 



Going back to G2,int and G 2 ,int> we obtain: 



_ 2t It 

|G 2 ,int - G 2j int| < — CcravMl-a) < G crn —— a/1 - a. 

7* A / 7* 



From part (a), we have that IGj r . — Gj r „ I < G crn a. Therefore, 

|G2,t — G2 t t\ < C 

crn , — (y/a + VI - a) < G crn — . □ 

Combining Claim [3~4T b) and Condition (0, we obtain: 

Corollary 3.5. In the exploitation phase, for any round t (except possibly the very last round in the phase) 
it holds w.h.p. that Gn > G2,*. 

By Corollary 13.51 regret accumulated by round t in the exploitation phase is, with high probability, equal 
to Gi t t — Gij — Gi,t- The following claim upper-bounds this quantity by 0(y/i log 2 n). The proof of this 
claim contains our main regret computation. 

Claim 3.6. For any round t in the exploitation phase it holds w.h.p. that 



Gi,t + G 2 , t - Gi, t > -Q(Vt log 



2 n) 



Proof. Throughout this proof, let us assume that the high-probability events in Claim [331 and Claim l3T4l b) 
actually hold; we will omit "with high probability" from here on. 

Let t be some (but not the last) round in the exploitation phase. First, 



H u - H 



+ 



H\t — Hi 



> -0(C crn /Vt). (6) 



We have upper-bounded the two square brackets in © using, respectively, Claim [3~4T b) and Condition ([3]). 
We proved that algorithm's average for arm 1 (H\j) i s not to ° small compared to the corresponding bench- 
mark average H\j, and we used the estimate Hij as an intermediary in the proof. 
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Similarly, using Condition ©, Condition ©, and Claim l3~4T b) to upper-bound the three square brackets 
in the next equation, we obtain that 



H 



2,i 



H 



i.t 



H 



2,t 



H 



i.t 



+ 



H 



2,t 



H 



2,f 



+ 



H 



i.t 



H 



i.t 



> -0(C cin /^). 



(V) 



Here we have proved that the algorithm did not do too badly playing arm 2, even though this arm was 
supposed to be suboptimal. Specifically, we establish that algorithm's average for arm 2 (#2,*) is not too 
small compared to the benchmark average for arm 1 {H\ )t )- Again, the estimates and B.2.t served us as 
intermediaries in the proof. 

Finally, let us go from bounds on average rewards to bounds on cumulative rewards (and prove the 
claim). Combining ©, (|7]) and Claim [331 we have: 



Hit — H\ 



> -0(C crn ) [T^/Vt + T^t)/^, 

> -0(C CTn )(Vt + V^logn) 

> -0{\ft log 2 n). 



□ 

Now we are ready for the final computations. We will need to consider three cases, depending on which 
phase the algorithm is in when it halts (i.e., reaches the time horizon). 

First, if the exploration phase never ends then by Claim [3~4l a) w.h.p. it holds that |flj )Tl — Hi >n \ < 
2C crn /y / n for each arm i, and the exit condition CD) never fails. This implies the claimed regret bound 

R n < 0(y/n\ogn). 

From here on let us assume that the exploration phase ends at some r* < n. Define regret on the time 
interval [a, b] as 

R[a,b] = max 52 b a =i 9i,t ~ Es=a 9h,t- 
Let t be the last round in the exploitation phase. By Corollary [33J and Claim 1331 we have 

R[i,t-i] = Gi, t - Gi it - G 2 ,t < 0(V^ log 2 n). 

If t = n (i.e., the algorithm halts during exploitation) then we are done. 

Thi rd, if the algorithm enters the adversarial phase then we can use the regret bound for Exp3.P in lAuer et al 
[2002b], which states that w.h.p. Ru n ] < 0(y/n). Therefore 

Rn < R[l,t-1] + R[t,n] < 0{y/n log 2 n). 

This completes the proof of Theorem l3.U a). 



3.4 Analysis: stochastic model 

We start with a simple claim that w.h.p. each arm is played sufficiently often during exploration, and arm 
1 is played sufficiently often during exploitation. This claim complements Claim 13.31 (which we will also 
re-use) which states that arm 2 is not played too often during exploitation. 

Claim 3.7. With high probability it holds that: 

(a) during the exploration phase, each arm is played at least r*/4 times. 

(b) during the exploitation phase, T\(t) > tj '4 for each time t. 
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Proof. Both parts follow from Chernoff Bounds. The only subtlety is to ensure that we do not condition the 
summands (in the sum that we apply the Chernoff Bounds to) on a particular value of t* or on the fact that 
arm 1 is chosen for exploitation. 

For part (a), without loss of generality assume that n fair coins are tossed in advance, so that in the t-th 
round of exploration we use the t-th coin toss to decide which arm is chosen. Then by Chernoff Bounds for 
each t w.h.p. it holds that among the first t coin tosses there are at least t/2 — C crn \Jt/2 heads and at least 
this many tails. We take the Union Bound over all t, so in particular this holds for t = r*. Therefore w.h.p. 
we have: 

Ti(n) > r*/2 - C crn y^72. (8) 

The claim follows from ([8]) because we force exploration to last for at least fl(C^ rn ) rounds. 

For part (b), let us analyze the exploitation phase separately. We are interested in the sum J2 S J l s , 
where s ranges over all rounds in the exploitation phase. We will work in the post-exploration probability 
space. The indicator variables I\ a , for all rounds s during exploitation, are mutually independent. Therefore 
Chernoff Bounds apply, and w.h.p. 

ri(i) - T^n) > (t - r*)/2 - C CTn Vt^. 

Using (El), it follows that T^t) > t/2 - C crn (/^ + x/t^T^) > t/2 - C crn x/i > t/4. □ 

Recall that Claim 13.41 b) connects algorithm's estimate H, L j and the benchmark average Hij (we will 
re-use this claim later in the proofs). In the stochastic model these two quantities, as well as the algorithm's 
average H^t, are close to the respective expected reward /Xj. The following lemma makes this connection 
precise. 

Claim 3.8. Assume the stochastic model. Then during the exploitation phase for each arm i and each time 
t the following holds with high probability: 

\Hi,t — Mil < Ccrn/V 7 *, 

< \H lft - ni\ <2C CTn /Vi, 
, \H2,t — M2I < 2 C crn / y/r^. 

Proof. All three inequalities follow from Chernoff Bounds. The first inequality follows immediately. To 
obtain the other two inequalities, we claim that w.h.p. it holds that 

\Hi,t -IM\< C CIn /y / T~(t). (9) 

Indeed, note that without loss of generality T independent samples from the reward distribution of arm i 
are drawn in advance, and then the reward from the £-th play of arm i is the ^-th sample. Then by Chernoff 
Bounds the bound (O holds w.h.p. for each Ti(t) = I, and then one can take the Union Bound over all I to 
obtain ©. Claim proved. 

Finally, we use Q and plug in the lower bounds on Tj(t) from Claim l3T7l ab). □ 

Now that we have all the groundwork, let us argue that in the stochastic model the consistency condition 
in the algorithm are satisfied with high probability. 

Corollary 3.9. Assume the stochastic model. Then in each round t of the exploitation phase, with high 
probability the following holds: 

16 C crn / < Ml - M2 < 32 C crn /V^. (10) 
Moreover, conditions (12(2]) are satisfied. 
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Proof. Condition © follows simply by combining Claim 1341 b) and Claim l3~8l 
To obtain flO]), we note that by ClaimEUb) and Claim EHw.h.p. it holds that 

|H M -Mi| + |H 2l t-A*2| <8C crn /Vn. (11) 

Recall that Condition (Q]) holds at time t = r* — 1, and fails at t = r*. This in conjunction with dTTT > (for 
t = r*) implies £[0]). In turn, ([T0l> with CD} imply Condition ©. □ 

To complete the proof of Theorem l3. U b). assume we are in the stochastic model with gap A = \pi — /^l- 
In the rest of the argument, we omit "with high probability". If the exploration phase never ends, it is easy 
to see that A < 0(l/y/n), and we are done since trivially R n < An < O(^). Else, by Corollary 13-91 it 
holds that arm 1 is optimal, r* = G(C crn /A) 2 and moreover that the exploitation phase never ends. Now, 
by Claim [331 in the exploitation phase the suboptimal arm 2 is played at most 0(r* log n) times. Therefore 



4 The SAO algorithm for the general case 

In this section we treat the general case: K arms and adaptive adversary. The proposed algorithm SAO 
(Stochastic and Adversarial Optimal), is described precisely in Algorithm Q] (see page [13]). On a high-level, 
SAO proceeds similarly to the simplified version in Section [3j but there are a few key differences. 

First, the exploration and exploitation phases are now interleaved. Indeed, SAO starts with all arms being 
"active", and then it successively "deactivates" them as they turn out to be suboptimal. Thus, the algorithm 
evolves from pure exploration (when all aims activated) to pure exploitation (when all arms but the optimal 
one are deactivated). 

Second, in order to make the above evolution smooth we adopt a more complicated (re)sampling sched- 
ule that the one we used in Section[3] Namely, the probability of selecting a given arm continuously increases 
while this arm stays active, and then continuously decreases when it gets deactivated, and the transition be- 
tween the two phases is also continuous. For the precise equation, see Equation (fT6l ) in Algorithm 1. 

Third, this more subtle behavior of the (re)sampling probabilities p^t in turn necessitates more com- 
plicated consistency conditions (e.g. see Condition (fT3l) compared to Condition ©), and a more intricate 
analysis. The key in the analysis is to obtain the good concentration properties of the different estima- 
tors, which we accomplish by exhibiting martingale sequences and resorting to Bernstein's inequality for 
martingales (Theorem l4.3l ). 

Recall that the crucial parameter for the stochastic model is the minimal gap A = mini; A;>o Aj, where 
Aj = (m&x.\<j<K Hj) — fJ>i is the gap of arm i. Our main result is formulated as follows: 

Theorem 4.1. SAO with (3 = n 4 satisfies 

j E[R n ] < O ^W'^W j in the stochastic model, 

E[i2n] < O ^log(A') log 3 ;/ ' 2 {ji)y/nK\ in the adversarial model. 

More precisely, for any 5 £ (0, 1), with probability at least 1 — 5, SAO with (3 = lOA'n 3 ^" 1 satisfies in the 
stochastic model: 

- 260K(l + logiv-)log 2 (/3) 
-Kn S , 

and in the adversarial model: 

Rn < 60(1 + log 10(1 + logn),JnKlog(P) + 5K 2 log 2 (/3) + 200 K 2 log 2 (/3). 
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Algorithm 1 The SAO strategy with parameter ft > 1 



A 4— {1, . . . , if} > v4 is the set of active arms 

for i = 1, . . . , K do > Initialization 

Tj <— n > Tj is the time when arm i is deactivated 

Pi ^— 1 /if > p« is the probability of selecting arm i 
end for 

for t = 1, . . . , n do i> Main loop 

Play i t at random from p i> Selection of the arm to play 

for i = 1, . . . , K do i> Test of four properties for arm i 

if t> Test if aim i should be deactivated 



10 

11 

12 
13 
14 
15 



18 
19 

20 



'4ino g (/?) , c (K\og{py 2 



? G A, and max H it - H it > 6\ + 5 (12) 

j6A •" y t \ t J 

then A ^ A\ {i}, Ti ^ t and q% ^ pi > Deactivation of arm i 

end if > qi denotes the probability of aim i at the moment when it was de-activated 

if one of the three following properties is satisfied 

then Start Exp3.P with the parameters described in [Theorem 2.4, Bubeck 1 201 oh l 



o Test if stochastic model still valid for arm % 
> First, test if the two estimates of H^t are consistent; let t* = min(Tj, t). 



16: > Second, test if the estimated suboptimality of arm i did not increase too much 



'4Klog(/3) , c /iflog(/3) x 



i g A, and maxH jit - H iyt > 10W ^-^ + 5 . (14) 

j€A f Tj-1 V T i - 1 / 

17: > Third, test if arm i still seems significantly suboptimal 



i t A, and ™fl„ - 4 , < + 5 (^L^V. (15 ) 

end if 

end for > End of testing 

for i = 1, . . . , K do > Update of the probability of selecting arm i 

^S^ + R^-gfSl 1 - <16) 



21: end for 
22: end for 
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We divide the proof into three parts. In Section |4~T1 we propose several concentration inequalities for the 
different quantities involved in the algorithm. Then we make a deterministic argument conditional on the 
event that all these concentration inequalities hold true. First, in Section l4~2l we analyze stochastic rewards, 
and S ection 1431 concerns the adversarial rewards. 

Let us discuss some notation. Recall that we denote by pi it the probability that the algorithm selects 
arm % at time t; this probability is denoted by pi in the description of the algorithm. As in Algorithm [Q q% 
will denote the probability of arm i at the moment when this arm was deactivated. Let A t denote the set 
of active arms at the end of time step t. We also introduce to as the last time step before we start Exp3.P, 
with a convention that tq = n if we never start Exp3.P. Moreover note that with this notation, if n < tq 
then we have qi = Pi, n - We generalize this notation and set qi := p^ min^.To)- F° r sa ke °f notation, in the 
following Tj denotes the minimum between the time when arm i is deactivated and the last time before we 
start Exp3.P, that is n <s— min(rj, tq). 



4.1 Concentration inequalities 

We start with two standard concentration inequalities for martingale sequences. 



Theorem 4.2 (Hoeffding-Azuma's inequality for martingales. iHoeffdingJ 1196311 ). 

Let T\ C • • • C J- n be a filtration, and X\, . . . , X n real random variables such that Xt is Tt-measurable, 
K(Xf\Tt-i) = and Xt € f^,^ + q] where At is a random variable Tt~\-measurable and Ct is a 
positive constant. Then, for any e > 0, we have 



n \ ( 2e 2 \ 

g>> £ )< exp (__j 



(17) 



or equivalently for any 5 > 0, with probability at least 1 — 5, we have 

n 



log(J-i) ^ 



t=i 



(18) 



Theorem 4.3 (Bernstein's inequality for martingales. iFreedmanl 1197511 ). 

Let T\ C ■ ■ ■ C J- n be a filtration, and X±, . . . , X n real random variables such that Xt is Tt-measurable, 
¥,(Xt\J-t-\) = 0, \Xt\ < b for some b > and let V n = X^™=1 ^(^ 2 |-^t-i)- Then, for any e > 0, we have 



(19) 



X t > e and V n < VJ < exp 



t=i 



2V + 2be/3 J ' 



and for any 5 > 0, with probability at least 1 — 5, we have either V n > V or 

t=l 6 



(20) 



Next we derive a version of Bernstein's inequality that suits our needs. 

Lemma 4.4. Let T\ C • • • C J- n be a filtration, and X\ , . . . , X n real random variables such that Xt is 
Tt-measurable, K(Xt\J-t-i) = and \Xt\ < b for some b > 0. Let V n = Y^t=i ^(-^t\ 3~t—i) an d S > 0. 
Then with probability at least 1 — 5, 



^2 x t < \J W n \og{n5~ l ) + 5b 2 \og 2 (n5- 1 ). 



t=\ 
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Proof. The proof follows from Theorem 14.31 along with an union bound on the events V n G [x,x + b], 

x G {0, b 2 , 2b 2 , . . . , (n - l)b 2 }. It also uses y/a + \fb < y/2(a + b). □ 

Now let us use this martingale inequality to derive the concentration bound for (average) estimated 
cumulative rewards H^. Recall that Hj )t is an estimator of Hi >t , so we want to upper-bound the difference 
\Hi t — Hi,t\i an d in the stochastic model H it is an estimator of the true expected reward /ij, so we want to 
upper-bound the difference | H^t — Mil • 

Lemma 4.5. For any arm i G {1, . . . , K} and any time t G {1, . . . , n}, in the stochastic model we have 
with probability at least \ — 8,ift< tq, 

h» - * < / 4 + + s W*^T 2 

V V f / V mm(rj,t) 

Moreover in the adversarial model we have with probability at least 1 — S,ift< tq, 



Hit — Ha 



< u,Kmm(n,t) + max(t-r t ,0)\ + g ^log(2f^-i; 



t 2 gjTjt / \ min(rj,t) 



Proof. The proof of the two concentration inequalities is similar, so we restrict our attention to the ad- 
versarial model. Let (Tg) be the filtration associated to the historic of the strategy. We introduce the 
following sequence of independent random variables: for 1 < i < K , 1 < s < n and p G [0, 1], let 
Z\{p) ~ Bernoulli(p). Then for t < tq we have, 

V Vi,s ~ QiTi \ s J 

For T G {l,...,n}, let 

xl(T) = (%£A _ ,) w + (_i_ z ; ) _ ,) S( ,w n . 

We have, for t < tq, 

t 

Gi,t - Gi.t = y^^c(Tj). 

s=l 

Now remark that (X*(T))i< s <t is a martingale difference sequences such that |JQ(T)| < K max l) 
(since > 1/K when s < Tj) and 

t mm(n,t) . . 

E,, ,■, „, v v — v 1 rmaxit — 1,0) 
e ((x](r)) 2 i j- s _!) < y, — + — br^- 
s =i ,=i ft ' s 

Thus, using Lemma l4~4l we obtain that with probability at least 1 — 5, 



8=1 



1 



_2_ + t max(t-r,0) \ log(M . 1) + 5A , 2 ^ A log2(w _ 1; 



Then, using an union bound over T, we obtain the claimed inequality by taking T = Tj (with another union 
bound to get the two-sided inequality). □ 
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Next, we analyze the (average) cumulative reward Hij collected by the algorithm. Again, in the stochas- 
tic model Hi :t can be used as an estimate of the true expected reward /Xj, and it is not hard to see that it is a 
reasonably sharp estimate. 

Lemma 4.6. For any arm i £ { 1 , . . . , K}, in the stochastic model we have with probability at least 1 — 5, 
for any time t 6 {1, . . . , n}, ifTi(t) > 1, 









*1 



, 21og(2n5- r 



m) 

Proof. This follows via an union bound over the value of Ti(t) and a standard Hoeffding's inequality for 
independent random variables, see Theorem |4.2j □ 

Next we show that, essentially, Tj(t) < 0(q-iTi + y/qiTi). 

Lemma 4.7. For any i S {1, . . . , K}, t G {1, . . . , n}, with probability at least 1 — 5, ift < tq, 



Ti(t) < qi Ti(l + logt) + y/4 qi Ti(l +logt)log(^- 1 ) +51og 2 (^- 1 ). 

Proof. Using the notation of the proof of Lemma 1431 we have for t < To, 

t 

T&) = J2zi(Pi,s)ts<n + Zl (^) l s>n . 



Let 



xi = (zi( PhS ) - Pi , s )i s < Ti + izi(m-^) i s>7 



Then (XI) is a martingale difference sequence such that \X*\ < 1 and, since pi jS is increasing in s for 
s < Tj, it follows that 



t 



Y J n{K?\Fs-i)<q*T i + y, ^<^(i+io g t). 



s=l S=T; + 1 

Thus using Lemma [4~4l we obtain that with probability at least 1 — 5: 

t 



Y x i( T ) < V4ftTi(l +logt)log(t5-i) + 51og 2 (t5- 1 ; 



s=l 



It implies that 



53^(p i , s )l.<T i + ^ (^) ls>r 8 < 9^(1 + log t) + ^4^(1 + logt)log(^- 1 ) + 51og 2 (t5- 1 



which is the claimed inequality. 



□ 



The next le mma restates regret guarantee for Exp3.P in terms of o ur setting. Inste ad of using the original 
g uarantee f rom I Auer et al. ll2002bll . we take an improved bound from lBubeckl (namely, Theorem 2.4 



in 



Bubeckl D2010Q). 
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Lemma 4.8. In the adversarial model, with probability at least 1 — 5, we have 

n n 

max V g i t - V gi t < 5.15 y/(n - t )K \og(K5~ l ) 

:{1,...,K} 



ie{i,...,K} t=rQ+1 t=TQ+1 



Let (3 = lOifn 3 ^" 1 . Putting together the results of Lemma 1431 l4~6l 14771 and l4~8l we obtain that with 
probability at least 1 — 5, the following inequalities hold true for any arm i G { 1 , . . . , K} and any time 
t G {l,...,r }: 

In the stochastic model, 



t z qint J \mm(Ti,t) J 



Hi t t — n 
In the adversarial model, 



Hit — H, 



i.t 



^ KnMn,t) + max(t-T„0) \ + g / g^gY (22) 



t 2 gjTjt 7 V min ( T i^) 

In the stochastic model, 



f 21og(/3) 



(23) 



In both models, 



Zi(t) < 9^(1 + log i) + \f4q iT i(l +logt)log(/3) + 51og 2 (/3). (24) 
In the adversarial model, 



max V 9j t - V # t t < 5.15a/ (n - r )inog(/3). (25) 
We will now make a deterministic reasoning on the event that the above inequalities are indeed true. 



4.2 Analysis in the stochastic model 



First note that by equations (|2TI) and (1231) . test (1131) is never satisfied. 

Let i* G argmaxj /Xj. Remark that by equation (|2TT >. test (fT2l is never satisfied for i*, since if i, i* G 
then 

-A, + 2,/i™ + 5 



i V i 



Thus we have i* G A t , Vt. Moreover if i G" A t , then it means that T{ < t and test (fT2l was satisfied at time 
step Tj (and not satisfied at time 73 — 1). Thus, using (|2T1 ). we see that if i g" At then it implies: 



Aj + 2 J 4Jn ° g(/3) + 5 f ^l) 2 > eJ^g^ + 5 ^ log(/3)X 2 



and (since z* G A t ) 



a, - 2, hJ^m + 5 f j£w> y < ,tep + 5 /g^v, (26) 



r,; - 1 V T,; — 1 / \ Ti - 1 V Ti - 1 
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Thus test (fT4l ) is never satisfied since: 



max Hj i - 4, « < A, + J + 5 Y < 10, ^2§W + s (5*)' 

Moreover (031) is also never satisfied, indeed since i* G At we have: 



mag H lf - H„, > A, - J^M + 5 (^W) 2 > 2j 4K ' T ° g( ' 8) 4- 5 ) ". 

In conclusion we proved that Exp3 is never started in the stochastic model, that is tq = n. Thus, using 
, we obtain: 

K 

R n = ^AiTi(n) 
i=i 

< ( qi Ti(l + logn) + \/%T i (l + log n)log(/3) + 5 log 2 . 

i=l ^ ' 
Now remark that for any arm i with Aj > 0, one can see that d26l ) implies: 

Tj < 259 t-s h 1 < 260 —5 . 

A 2 ~~ A 2 

Indeed if n > 259 X1 ° g 2 (/3) + 1, then 



f Tj-l V r i - 1 / v 259 259 

which contradicts (l26l) . 

The proof is concluded with straightforward computations and by showing that 

K 

]Tg 4 < l + logK. (27) 

i=l 

Denote by T(i) < . . . < the ordered random variables n, . . . , tr-. Then we clearly have g^) < x , 
which proves (l27l) . 



4.3 Analysis in the adversarial model 

Let j* e argmax 1< j </( Gj jT0 _i. First we show that i* £ A To -i- Let/* G argmax^g^ 1 Gj iT0 _i and 
z A^-i, then we have, by r* < ro — 1, (|22l) and since (fT5T > is not satisfied at time ro — 1: 

Gi* jTQ —i — Gj )To _i 



> _./4 ( *2L_ + *Zlz£) l„g(« + 5 Y - + 5 ^ ™^ 



(r - l) 2 qiTi(T -l)J V. Tj / V r o ~ 1 V r o - 1 



+ 2 UKlogtf) 5 {Klog(py* 



> _, /4 ( + ^IZZ* ) log(/3 ) + 5 (™ V + + 5 

(r - l) 2 qiTi(r Q -l)J \ n ) V r. 
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where the last inequality follows from > \/K and 



Tj Tp-l-Tj < 1_ 



(r - l) 2 Tj(ro-l) Tj' 

This proves i* G ^4 ro -i- Thus we get, using the fact that ( [TBI and (fT4l are not satisfied at time To — 1, as 
well as (l22l ). and the fact that (fT2l is not satisfied for active arms at time tq — 1, 

k 

Rtq—1 — Gi* fT Q— 1 ^ ^ Gj T0 —i 
1=1 

K 



i=l 
A 



— ^ ^i(ro — 1) ^i/j* iT0 _i — fl"i* jTo _i + _ffj* :To _i — Hi jTo -i + £T 
i=i 



i,ro— 1 Hi,To—l 



21og(/3) 



Ti(r - 1) 



Then, using (1241 ) and d25l ) we get, thanks to r,j > 2, 



< 1 + 6.6y/nKlog(P) + 12^^(1 + log n) ^16^ log(/3) + 20(Klog(/3)) 5 



i=l 

A" 



+ 12g (4 m (l + l= g »)kg<0 + 61o g »<fl) (^f^f + 5(^1) 



2^ 



< 60(1 + log K) (1 + log n) yj nK log(/3) + if 2 log 2 (fi) + 200K 2 log 2 (/5) , 
where the last inequality follows from (|2~71) and straightforward computations. 

Acknowledgements. We thank Peter Auer for insightful discussions. 
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A Concentration inequalities 

Recall that the analysis in Section [3] relies on Chernoff Bounds as stated in Theorem 13.21 Let us derive 
Theorem 1 3 . 2 1 from a version of Chernoff Bounds that can be found in the literature. 



Theorem A.l (Chernoff Bounds: Theorem 2.3 in lMcDiarmidl B1998I1 ). Consider n i.i.d. random variables 
X\ . . . X n on [0, 1]. Let X = — Ylt=i be their average, and let p = E[X]. Then for any e > the 
following two properties hold: 

e*n/3 e < ! 



(a, Pr[X > (1 + ,)„] < exp (-^) < ' • ^ 



(b) Pv[X < (1 - e)p] < e~ £ "/ 2 . 
Corollary A.2. In the setting of Theorem \A. 1\ for any f3 > Owe have: 

Pt[\X-/a\>P max(/3, yfp) ] < 2 e~^ /3 . (28) 

We obtain Theorem \3.2\ bv taking (3 = \fC, noting that (3 max(/3, yj~p) < C max(l, yfp)for C > 1. 

Proof. Fix j3 > and consider two cases: p > f3 2 and /i < /3 2 . 

If /i > /? 2 then we can take e = (3/^/p < 1 in Theorem lA.lf ab) and obtain 

Pt[\X -h\> (3^M = Pr[|X - m| > ep\ < 2 e'^^ 3 = 2 e~ p2/3 . 
Now assume p < f3 2 . We can still take e = fi/yfp in Theorem I A. If b) to obtain 

Pr[X - p < -I3 2 ] < ¥t[X -p< -/Va*] < e" £V/2 = e~ p2/2 . 
Then let us take e = \5 2 j p> \ m Theorem I A.l f a) to obtain 

Pr[X - p > f3 2 ] = Pr[X - p > ep] < e~ £ ^ 3 = e~ p2/3 . 
It follows that Pr[ \X - p\ > (3 2 } < 2 e~^l' 3 , completing the proof. □ 



21 



